Smart exploration in reinforcement learning using absolute temporal difference errors

نویسندگان

  • Clement Gehring
  • Doina Precup
چکیده

Exploration is still one of the crucial problems in reinforcement learning, especially for agents acting in safety-critical situations. We propose a new directed exploration method, based on a notion of state controlability. Intuitively, if an agent wants to stay safe, it should seek out states where the effects of its actions are easier to predict; we call such states more controllable. Our main contribution is a new notion of controlability, computed directly from temporaldifference errors. Unlike other existing approaches of this type, our method scales linearly with the number of state features, and is directly applicable to function approximation. Our method converges to correct values in the policy evaluation setting. We also demonstrate significantly faster learning when this exploration strategy is used in large control problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Consistent exploration improves convergence of reinforcement learning on POMDPs

This paper sets out the concept of consistent exploration of observation-action pairs. We present a new temporal difference algorithm, CEQ(λ), based on this concept and demonstrate using a randomly generated set of partially observable Markov decision processes (POMDPs) that it outperforms SARSA(λ). This result should generalise to any POMDP where satisficing policies which map observations to ...

متن کامل

Reducing Commitment to Tasks with Off-Policy Hierarchical Reinforcement Learning

In experimenting with off-policy temporal difference (TD) methods in hierarchical reinforcement learning (HRL) systems, we have observed unwanted on-policy learning under reproducible conditions. Here we present modifications to several TD methods that prevent unintentional on-policy learning from occurring. These modifications create a tension between exploration and learning. Traditional TD m...

متن کامل

Resource Constrained Exploration in Reinforcement Learning

This paper examines temporal difference reinforcement learning (RL) with adaptive and directed exploration for resource-limited missions. The scenario considered is for an energy-limited agent which must explore an unknown region to find new energy sources. The presented algorithm uses a Gaussian Process (GP) regression model to estimate the value function in an RL framework. However, to avoid ...

متن کامل

Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences

This paper presents “Value-Difference Based Exploration” (VDBE), a method for balancing the exploration/exploitation dilemma inherent to reinforcement learning. The proposed method adapts the exploration parameter of ε-greedy in dependence of the temporal-difference error observed from value-function backups, which is considered as a measure of the agent’s uncertainty about the environment. VDB...

متن کامل

Using Reinforcement Learning to Make Smart Energy Storage Source in Microgrid

The use of renewable energy in power generation and sudden changes in load and fault in power transmission lines  may cause a voltage drop in the system and challenge the reliability of the system. One way to compensate the changing nature of renewable energies in the short term without the need to disconnect loads or turn on other plants, is the use of renewable energy storage. The use of ener...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013